INTERSPEECH.2004 - Analysis and Assessment | Cool Papers

#1 Biomechanical parameter fingerprint in the mucosal wave power spectral density [PDF] [Copy] [Kimi]

Authors: Juan-Ignacio Godino-Llorente ; Victoria Rodellar-Biarge ; Pedro Gomez-Vilda ; Francisco Diaz-Perez ; Agustin Alvarez-Marquina ; Rafael Martinez-Olalla

The importance of mucosal wave detection and estimation has been stressed in literature regarding the automatic classification and recognition of larynx pathologies from voice records. Using a new estimation method of the mucosal wave correlate and simulation results from a 2-mass model of the vocal folds the present paper shows that the main fingerprints found in the power spectral density of the mucosal wave correlate are directly related to the biomechanics of the system. These findings open the door to the non-invasive estimation of biomechanical parameters of the vocal folds directly from voice records, thus easing the task of automatic pathologic classification and recognition.

#2 Classification of pathological voice including severely noisy cases [PDF] [Copy] [Kimi]

Authors: Cheolwoo Jo ; Soo-Geon Wang ; Byung-Gon Yang ; Hyung-Soon Kim ; Tao Li

In this paper we tried to classify pathological voices from normal ones based on two different parameters, spectral slope and ratio of energies in harmonic and noise components (HNR), and artificial neural network (ANN). Voice data from normal peoples and patients were collected, then classified into three different categories (normal, relatively less noisy and severely noisy pathological data). The spectral slope and HNR were computed and used to classify severely noisy pathological voice from others first because of its much noise. Then artificial neural network was used as a classifier to discriminate the rest of data into normal and relatively less noisy pathological categories when common numerical parameters were used as inputs. And the classification results were evaluated by comparing the distribution characteristics of the spectral slope and HNR for all of data and analyzing the classification rates for the normal and relative less noisy pathological voices.

#3 A robust glottal source model estimation technique [PDF] [Copy] [Kimi]

Authors: Qiang Fu ; Peter Murphy

This paper describes a robust glottal source estimation method based on a joint source-filter separation technique. In this method, the glottal flow derivative is modelled as the Liljencrants-Fant (LF) model and the vocal tract is described as a time-varying ARX model. Since the joint estimation problem is a multi-parameter nonlinear optimization procedure, we separate the optimization procedure into two passes. The first pass initializes the glottal source and vocal tract models providing robust initial parameters to the following joint optimization procedure. The joint estimation determines the accuracy of model estimation, which is implemented with a trust-region descent optimization algorithm. Experiments with synthetic and real voices show the proposed method is a robust glottal source parameter estimation method with a considerable degree of accuracy.

#4 F0 and formant frequency distribution of dysarthric speech - a comparative study [PDF] [Copy] [Kimi]

Authors: Hiroki Mori ; Yasunori Kobayashi ; Hideki Kasuya ; Hajime Hirose ; Noriko Kobayashi

We are investigating acoustical analysis for dysarthric speech, which appears as a symptom of neurologic disease, in order to elucidate its physiological and acoustical mechanism, and to develop aids for diagnosis and training, etc. In this report, acoustical characteristics of various kinds of dysarthrias are measured. As a result, shrinking of the F0 range as well as vowel space are observed in dysarthric speech. Also, from the comparison of F0 range and vowel formant frequencies it is suggested that speech effort to produce wider F0 range can influence vowel quality as well.

#5 Procedure "senza vibrato": a key component for morphing singing [PDF] [Copy] [Kimi]

Authors: Hideki Kawahara ; Yumi Hirachi ; Morise Masanori ; Hideki Banno

A procedure to remove vibrato from singing voice was proposed to enable auditory morphing between musical performances played under different conditions. Analyses of singing samples in the RWCP-music database using a speech analysis, modification and synthesis system STRAIGHT provided necessary information to implement "senza vibrato," the procedure that removes vibrato. A preliminary subjective evaluation for artificially adding and removing vibrato indicated that the proposed procedure effectively control perceived vibrato while preserving naturalness of the original singing.

#6 Thyroplastic medialisation in unilateral vocal fold paralysis: assessing voice quality recovering [PDF] [Copy] [Kimi]

Authors: Claudia Manfredi ; Giorgio Peretti ; Laura Magnoni ; Fabrizio Dori ; Ernesto Iadanza

Medialization thyroplasty and endoscopic intracordal infusion of fat or heterologous materials are the treatments of choice for glottic incompetence, of both neurological and cicatricial origin. Functional evaluation after thyroplastic medialisation often based on several approaches, in order to assess effectiveness of the adopted technique. The most common analysis methods are: videolaryngostroboscopy (VLS), morphological aspects evaluation, GRBAS scale and (Voice Handicap Index), relative to perceptive and subjective voice analysis, and MDVP(R), that provides objective acoustic parameters. First results are presented here, obtained both with approaches and a new voice analysis tool, based on robust estimators for tracking fundamental frequency F0, noise formants. New indexes are also proposed, to easily quantify voice quality recovering. The proposed approach successfully applied to patients that underwent thyroplastic medialisation, and is suited for integrating the MDVP features.

#7 Voice enhancement of male speakers with laryngeal neoplasm [PDF] [Copy] [Kimi]

Authors: Gernot Kubin ; Martin Hagmueller

In this paper an approach is presented which is aimed to enhance disordered male voices. This approach aims at high-pitched voices with a severe degree of hoarseness. The goal of the study is to determine whether pitch modification combined with periodicity enhancement can improve the perceived quality of a disordered speech utterance. Signal manipulation is done pitch-synchronously, so, firstly pitch marks have to be detected. Then period enhancement is performed followed by a PSOLA based pitch modification step. Finally, the period enhancement is performed once more for the utterance with the lower pitch. Perceptual evaluations were performed by both professional speech and language pathologists and naive listeners to rate the subjective perceived enhancement of the voice. Results show that the modified voice has a reduced breathiness, whereas roughness seems not to be influenced by the processing. The most significant result from the naive listener test is a reduced perceived speaking effort.

#8 A comparison of the perturbation analysis between PRAAT and computerize speech lab [PDF] [Copy] [Kimi]

Authors: Jong Min Choi ; Myung-Whun Sung ; Kwang Suk Park ; Jeong-Hun Hah

Programs for the analysis of pathological voice data have been presented for the last few decades. Computerized Speech Lab (CSL) has been recognized as a standard in the field of pathological voice analysis. PRAAT is a new open program which is constantly being improved almost every week. CSL gives the "Voice disorders database" which was produced by Massachusetts Eye and Ear Infirmary. In this paper, we use PRAAT program for the analysis of voice data and compare the result of PRAAT with the result of CSL. We focus on the perturbation analysis on frequencies using several parameters like jitter, standard deviation of fundamental frequencies etc.

#9 Reconstruction filter design for bone-conducted speech [PDF] [Copy] [Kimi]

Authors: Toshiki Tamiya ; Tetsuya Shimamura

Bone-conducted speech is of low intelligibility, but the quality is not affected by noise. In this paper, we take into account such properties of bone-conducted speech, and address a digital filter to reconstruct the quality of the bone-conducted speech signal obtained from a speaker. The reconstruction filter design method is derived based on a model assumption of pronunciation. Experimental results show that the reconstructed speech signal has better quality than the bone-conducted speech signal.

#10 Frequency warped ARMA analysis of the closed and the open phase of voiced speech [PDF] [Copy] [Kimi]

Authors: Pedro J. Quintana-Morales ; Juan L. Navarro-Mesa

We propose a frequency warped version of a pole-zero analysis obtained from several periods of voiced speech. We approach the estimation of the coefficients associated to the poles and zeros by minimizing a cost function based on the 'warped' reconstruction error. The results reinforce our previous work letting us an extension of the initial equations by introducing the concept of auditory perception in the formulation. This facilitates an improvement in the analysis of the phases associated to consecutive periods. With the experiments we address several objectives. First, to evaluate the importance of the time extension due to warping. Second, to obtain an optimum warping factor from a reconstruction error point of view. Third, to study the behaviour of our analysis with the period length. And fourth, to study the distribution of the error in frequency. Our results indicate that the use of warped techniques is beneficial in speech analysis.

#11 Zeros of z-transform (ZZT) decomposition of speech for source-tract separation [PDF] [Copy] [Kimi]

Authors: Boris Doval ; Baris Bozkurt ; Christophe D'Alessandro ; Thierry Dutoit

This study proposes a new spectral decomposition method for source-tract separation. It is based on a new spectral representation called the Zeros of Z-Transform (ZZT), which is an all-zero representation of the z-transform of the signal. We show that separate patterns exist in ZZT representations of speech signals for the glottal flow and the vocal tract contributions. The ZZT-decomposition is simply composed of grouping the zeros into two sets, according to their location in the z-plane. This type of decomposition leads to separating glottal flow contribution (without a return phase) from vocal tract contributions in z domain.

#12 Use of neural network mapping and extended kalman filter to recover vocal tract resonances from the MFCC parameters of speech [PDF] [Copy] [Kimi]

Authors: Li Deng ; Roberto Togneri

In this paper, we present a state-space formulation of a neural-network-based hidden dynamic model of speech whose parameters are trained using an approximate EM algorithm. The training makes use of the results of an off-the-shelf formant tracker (during the vowel segments) to simplify the complex sufficient statistics that would be required in the exact EM algorithm. The trained model, consisting of the state equation for the target-directed vocal tract resonance (VTR) dynamics on all classes of speech sounds (including consonant closure) and the observation equation for mapping from the VTR to acoustic measurement, is then used to recover the unobserved VTR based on Extended Kalman Filter. The results demonstrate accurate estimation of the VTRs, especially those during rapic consonant-vowel or vowel-consonant transitions and during consonant closure when the acoustic measurement alone provides weak or no information to infer the VTR values.

#13 Graphical model approach to pitch tracking [PDF] [Copy] [Kimi]

Authors: Xiao Li ; Jonathan Malkin ; Jeff Bilmes

Many pitch trackers based on dynamic programming require meticulous design of local cost and transition cost functions. The forms of these functions are often empirically determined and their parameters are tuned accordingly. Parameter tuning usually requires great effort without a guarantee of optimal performance. This work presents a graphical model framework to automatically optimize pitch tracking parameters in the maximum likelihood sense. Therein, probabilistic dependencies between pitch, pitch transition and acoustical observations are expressed using the language of graphical models, and probabilistic inference is accomplished using the Graphical Model Toolkit (GMTK). Experiments show that this framework not only expedites the design of a pitch tracker, but also yields remarkably good performance for both pitch estimation and voicing decision.

#14 A new multicomponent AM-FM demodulation with predicting frequency boundaries and its application to formant estimation [PDF] [Copy] [Kimi]

Authors: Bo Xu ; Jianhua Tao ; Yongguo Kang

In this paper, a method using dynamic programming to predict frequency boundaries is proposed for AM-FM demodulation for speech signals. An algorithm called energy separation algorithm (ESA) has been developed to track the energy needed by a source to produce the speech signal, and this algorithm provides an efficient solution to separate output energy product into amplitude modulation and frequency modulation components. For multicomponent AM-FM signals like speech signals, a bank of bandpass filters or a set of individual bandpass filters, whose center frequency and critical bandwidth commonly are selected through experiential selection, is necessary to get monocomponent signals. Our experimental results provide that the bandpass filter with predicted frequency boundaries instead of experiential selection is more effective in AM-FM demodulation. Formant estimation based on this demodulating method also proves that it is efficient and formant tracking algorithm is not necessary at all in the estimating procedure.

#15 A concurrent curve strategy for formant tracking [PDF] [Copy] [Kimi]

Author: Yves Laprie

Although automatic formant tracking has a wide range of potential applications it is still an open problem. We previously proposed the use of active curves that deform under the influence of the spectrogram energy. Each formant was tracked independently and a complex strategy was required to guarantee the overall formant tracking consistency. This paper describes how the interdependency between formants can be incorporated directly during the deformations of formant tracks. Iterative processes attached to each formant are interlaced. We experimented two strategies. The first consists in partitioning the spectrogram into exclusive regions, each region affiliated to a given formant. The second consists in adding a repulsion force between formants that prevents formant tracks to merge together. It turns out that the second strategy is more robust and does not necessitate a complex control strategy.

#16 A formant tracking LP model for speech processing [PDF] [Copy] [Kimi]

Authors: Qin Yan ; Esfandiar Zavarehei ; Saeed Vaseghi ; Dimitrios Rentzos

This paper investigates the modeling and estimation of spectral parameters at formants of noisy speech in the presence of car and train noise. Formant estimation using two-dimensional hidden Markov models (2D-HMM) is reviewed and employed to study the influence of noise on observations of formants. The first set of experimental results presented show the influence of car and train noise on the distribution and the estimates of the formant trajectories. Due to the shapes of the spectra of speech and car/train noise, the 1st formant is most affected by noise and the last formant is least affected. The effects of inclusion of formant features in speech recognition at different SNRs are presented. It is shown that formant features provide better performance at low SNRs compared to MFCC features. Finally, for robust estimation of noisy speech, a formant tracking method based on combination of LP-spectral subtraction and Kalman filter is presented. Average formant tracking errors at different SNRs are computed and the results show that after noise reduction the formant tracking errors of 1st formant are reduced by 60%. The de-noised formant tracking LP models can be used for recognition and/or enhancement of noisy speech.

#17 Application of long-term filtering to formant estimation [PDF] [Copy] [Kimi]

Author: Hong You

We propose a formant analysis algorithm that works well for highpitched speakers. The algorithm reduces the influence of pitch frequency on formant analysis. A pitch-synchronized long-term filter is optimized and applied to speech signals before LPC analysis. A weighted LPC analysis method is proposed to compute the auto-regression model parameters and hence the formants. Processing synthetic speech and natural speech show that the spectra of the longterm filtered residue exhibit the formant structure better than those of the speech signal, especially when the pitch frequency is high. A comparison between formant tracking by the proposed algorithm and ESPS waves+ shows promising results.

#18 A method for glottal formant frequency estimation [PDF] [Copy] [Kimi]

Authors: Baris Bozkurt ; Thierry Dutoit ; Boris Doval ; Christophe D'Alessandro

This study presents a method for estimation of glottal formant frequency (Fg) from speech signals. Our method is based on zeros of z-transform decomposition of speech spectra into two spectra : glottal flow dominated spectrum and vocal tract dominated spectrum. Peak picking is performed on the amplitude spectrum of the glottal flow dominated part. The algorithm is tested on synthetic speech. It is shown to be effective especially when glottal formant and first formant of vocal tract are not too close. In addition, tests on a real speech example are also presented where open quotient estimates from EGG signals are used as reference and correlated with the glottal formant frequency estimates.

#19 Improved differential phase spectrum processing for formant tracking [PDF] [Copy] [Kimi]

Authors: Baris Bozkurt ; Thierry Dutoit ; Boris Doval ; Christophe D'Alessandro

This study presents an improved version of our previously introduced formant tracking algorithm. The algorithm is based on processing the negative derivative of the argument of the chirp-z transform (termed as the differential phase spectrum) of a given speech signal. No modeling is included in the procedure but only peak picking on differential phase spectrum. We discuss the effects of roots of z-transform to differential phase spectrum and the need to ensure that all zeros are at some distance from the circle where chirp-z transform is computed. For that, we include an additional zero-decomposition step in our previously presented algorithm to improve its robustness. The final version of the algorithm is tested for analysis of synthetic speech and real speech signals and compared to two other formant tracking systems.

#20 MAP prediction of pitch from MFCC vectors for speech reconstruction [PDF] [Copy] [Kimi]

Authors: Xu Shao ; Ben P. Milner

This work proposes a method of predicting pitch and voicing from mel-frequency cepstral coefficient (MFCC) vectors. Two maximum a posteriori (MAP) methods are considered. The first models the joint distribution of the MFCC vector and pitch using a Gaussian mixture model (GMM) while the second method also models the temporal correlation of the pitch contour using a combined hidden Markov model (HMM)-GMM framework. Monophone-based HMMs are connected together in the form of an unconstrained monophone grammar which enables pitch to be predicted from unconstrained speech input. Evaluation on 130,000 MFCC vectors reveals a voicing classification accuracy of over 92% and an RMS pitch error of 10Hz. The predicted pitch contour is also applied to MFCC-based speech reconstruction with the resultant speech almost indistinguishable from that reconstructed using a reference pitch.

#21 New harmonicity measures for pitch estimation and voice activity detection [PDF] [Copy] [Kimi]

Authors: An-Tze Yu ; Hsiao-Chuan Wang

Harmonic structure can be easily recognized in the time-frequency representation of speech signals even in diverse environment. The harmonicity is a measure of the completeness of harmonic structure. This paper extends the use of conventional harmonicity measure to the tasks of pitch estimation and voice activity detection. A set of hierarchical harmonicities, including grid, temporal, spectral and segmental harmonicities, is derived for this purpose. A series of experiments are conducted to show the effectiveness of using harmonicities in speech processing.

#22 Multi-pitch trajectory estimation of concurrent speech based on harmonic GMM and nonlinear kalman filtering [PDF] [Copy] [Kimi]

Authors: Takuya Nishimoto ; Shigeki Sagayama ; Hirokazu Kameoka

This paper describes a multi-pitch tracking algorithm of 1-channel simultaneous multiple speech. The algorithm selectively carries out the two alternative processes at each frame: frame-independent-process and frame-dependent-process. The former is the one we have previously proposed, that gives good estimates of the number of speakers and F0s with a single-frame-processing. The latter corresponds to the topic mainly described in this paper, that recursively tracks F0s using nonlinear Kalman filtering. We tested our algorithm on simultaneous speech signal data and showed higher performance than when the frame-independent-process was only used.

#23 Automatic pitch marking and reconstruction of glottal closure instants from noisy and deformed electro-glotto-graph signals [PDF] [Copy] [Kimi]

Authors: Attila Ferencz ; Jeongsu Kim ; Yong-Beom Lee ; Jae-Won Lee

Pitch tracking and pitch marking (PM) are two important speech signal analysis techniques for several applications. The accuracy of both pitch marking and tracking is significant to generate smooth synthesized speech by controlling the pitch and duration of voiced speech in Text-to-Speech (TTS) system for example. In this paper, we present a novel hybrid approach, combining electro-glotto-graph (EGG)-based PM and speech signal-based PM into a single framework, to acquire more reliable and automatic PM technique. Experimental results show that the PM performance of the suggested method is excellent being capable of determining Glottal Closure Instants (GCI) precisely even in the case of noisy EGG signals.

#24 On the use of a weighted autocorrelation based fundamental frequency estimation for a multidimensional speech input [PDF] [Copy] [Kimi]

Authors: Federico Flego ; Luca Armani ; Maurizio Omologo

The problem of computing the fundamental frequency F0 in an accurate way is a known and still partially unsolved problem, especially given a noisy speech input. In this work, a distant-talking scenario is addressed, where a distributed microphone network provides multi-channel input sequences to process for speaker modeling purposes. Given this context, one may process in an independent way each channel and then apply a majority vote or other fusion methods. Otherwise, the redundancy across the channels can be exploited jointly by processing the different signals to obtain a more reliable and robust F0 estimation. The paper investigates the use of a multi-channel version of a Weighted Autocorrelation(WAUTOC)-based F0 estimation technique. A postprocessing corrective step is introduced to improve the resulting F0 accuracy. Experiments conducted on a real database show the advantages and the robustness of the proposed method in extracting the fundamental frequency with no regard about the microphone and talker position as well as the head orientation.

#25 A minimum mean squared error estimator for single channel speaker separation [PDF] [Copy] [Kimi]

Authors: Aarthi M. Reddy ; Bhiksha Raj

The problem of separating out the signals for multiple speakers from a single mixed recording has received considerable attention in recent times. Most current techniques are based on the principle of masking: in order the separate out the signal for any speaker, frequency components that are not believed to belong to that speaker are suppressed. The signals for the various speakers are reconstructed from the partial spectral information that remains. In this paper we present a different kind of technique -- one that attempts to estimate all spectral components for the desired speaker. Separated signals are derived from the complete spectral descriptions so obtained. Experiments show that this method results in superior reconstruction to masking based reconstruction.